Hockey Analytics

By Raymond Wong, Mark Dodd, Dustin Tang

Introduction

As frivolous as sports may seem, there is no denying the passion and love that sport elicits in population. It is a passion that unites people in celebration and agony; bringing people together in joy and sorrow. This passion has created an entire sub-industry: sports analytics.

Sports are a results based entertainment industry where a winner is ultimately, and quantifiably, established, and this generates a wealth of data to be analyzed. This analysis is not fully embraced by everyone who discuss, manage, or play the games - but that doesn’t mean it hasn’t had an impact. Decade old “laws” regarding sports betting have been changed; how teams pick players have been changed; how we watch and enjoy the game has changed. Sports data analysis was immortalized in the movie Moneyball, starring Brad Pitt, based on a true story of how the Oakland Athletics revolutionized baseball.

Our focus for this project will be applying data analysis to the professional hockey domain. What we are interested in is trying to find drivers of what makes a winning team in the NHL. We want to investigate what variables contribute to winning and we will be looking for particular trends in the data. The type of statistics that we will be looking at are related to:

  • home ice advantage
  • power play advantage
  • shots (amount and location).

Investigating this can focus teams on changing the way they play the game and build a better model on how to predict outcomes on games, furthering that analysis and passion.

Data

Our data is a public dataset from Kaggle called the NHL Game Data (see references). The dataset was created using the NHL api which has been documented at https://gitlab.com/dword4/nhlapi. In addition to the Kaggle dataset we have polled the API directly to gather specific information we required for our analysis.

The Kaggle NHL dataset can be visualized with the following entity relationship diagram.

erdiagram

Guiding Questions

  1. “Do teams that win at home more often, win?”
  2. "It's isn't how often special teams work... it is when they work"
  3. “When You Put the Puck on the Net, Good Things Happen”

1. “Do teams win more at home? And in what way?”

  • Purpose
  • Data Wrangling
  • Results

Purpose

We wanted to investigate:

  1. Home-ice advantage over the course of the last few seasons.

    Do better teams win more home games?

  2. In a supposed level-playing field, in what way do home teams gain an advantage?

Data Wrangling

  • Collected all results of each regular season in the last 6 seasons, and the home team
  • Group by each team and apply an overall % Home Win / Total Wins
  • Collected each teams overall ranking in each season
  • Wrangling: Team Name changes, Expansion within the league
In [1]:
import pandas as pd
import numpy as np

import ipywidgets as widgets
from ipywidgets import interact

import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=False)

“When You Put the Puck on the Net, Good Things Happen”

  • Purpose
  • Data Wrangling
  • Results

Purpose

We wanted to investigate where players shoot from and where goals come from.


Shot Location

Where a player shots the puck from is an important strategy in hockey. There are fans and coaches who will always tell you to "Shooooooot".

But what locations offer the best locations for scoring?

Data Wrangling

For this visualization we focused on a single season of data, but it involved a significant amount of data wrangling. We had to perform the following:

  • Draw an NHL rink
  • Get all shots and goals and their locations
  • Merge with teams table to bring in team information
  • Merge with player event table to allow us to merge with the player table to import player data

Create the NHL Rink

The following code cell will create a function that returns a list of plotly shapes to use as one half of the NHL ice surface.

In the code we "flip" the shot and goal positions around center ice so that all of the data is concentrated on half the ice surface.

In [2]:
def draw_shape(shape_, p1, p2, width = 1, color = None, fill = None):

    shape = dict(
        type = shape_, xref = 'x', yref = 'y',
        x0 = str(p1[0]), y0 = str(p1[1]),
        x1 = str(p2[0]), y1 = str(p2[1]),
        line = dict(
            width = width
        ))

    if color is not None:
        shape['line']['color'] = color

    if fill is not None:
        shape['fillcolor'] = fill

    return shape



def draw_arc(m1, m2, c1, c2, c3, c4, c5, c6, width = 1, color = None):
    # first we convert arguments into the path string
    m = " ".join(["M", str(m1), str(m2)])
    c = " ".join(['C', str(c1), str(c2) + ',',
                       str(c3), str(c4) + ',',
                       str(c5), str(c6)])
    p = " ".join([m, c])
    shape = dict(
        type = 'path', xref = 'x', yref = 'y',
        path = p,
        line=dict(
            width = width
        ))

    return shape

def draw_nhl_rink():

    # colors constants to reduce code late
    _RED = 'rgba(255, 0, 0, 1)'
    _BLUE = 'rgba(0, 0, 255, 1)'
    _FACEOFF = 'rgba(10, 10, 100, 1)'

    # build a dictionary to store our rink shapes
    nhl_rink = {}
    nhl_rink["outer_rink"] = draw_shape('rect', (-250, 0), (250, 516.2))
    nhl_rink["outer_line"] = draw_shape('line', (200, 580), (-200, 580))
    nhl_rink["center_line"] = draw_shape('line', (-250, 0), (250, 0), color = _RED)
    nhl_rink["end_line"] = draw_shape('line', (-250, 516.2), (250, 516.2), color = _RED)
    nhl_rink['blue_line'] = draw_shape('rect', (250, 150.8), (-250, 156.8), color = _BLUE, fill = _BLUE)
    nhl_rink['center_dot'] = draw_shape('circle', (2.94, 2.8), (-2.94, -2.8), color = _BLUE, fill = _BLUE)
    nhl_rink['center_circle'] = draw_shape('circle', (88.2, 87), (-88.2, -87), color = _BLUE)
    nhl_rink['offside_dot1'] = draw_shape('circle', (135.5, 121.8), (123.5, 110.2), color = _RED, fill = _RED)
    nhl_rink['offside_dot2'] = draw_shape('circle', (-135.5, 121.8), (-123.5, 110.2), color = _RED, fill = _RED)
    nhl_rink['zone_dot1'] = draw_shape('circle', (135.5, 406), (123.5, 394.4), color = _RED, fill = _RED)
    nhl_rink['zone_dot2'] = draw_shape('circle', (-135.5, 406), (-123.5, 394.4), color = _RED, fill = _RED)
    nhl_rink['zone_circle1'] = draw_shape('circle', (217.6, 487.2), (41.2, 313.2), color = _RED)
    nhl_rink['zone_circle2'] = draw_shape('circle', (-217.6, 487.2), (-41.2, 313.2), color = _RED)
    nhl_rink['zone1_line1'] = draw_shape('line', (30.04, 416.4), (41.8, 416.4), color = _RED)
    nhl_rink['zone1_line2'] = draw_shape('line', (30.04, 384), (41.8, 384), color = _RED)
    nhl_rink['zone1_line3'] = draw_shape('line', (228.76, 416.4), (217, 416.4), color = _RED)
    nhl_rink['zone1_line4'] = draw_shape('line', (228.76, 384), (217, 384), color = _RED)
    nhl_rink['zone2_line1'] = draw_shape('line', (-30.04, 416.4), (-41.8, 416.4), color = _RED)
    nhl_rink['zone2_line2'] = draw_shape('line', (-30.04, 384), (-41.8, 384), color = _RED)
    nhl_rink['zone2_line3'] = draw_shape('line', (-228.76, 416.4), (-217, 416.4), color = _RED)
    nhl_rink['zone2_line4'] = draw_shape('line', (-228.76, 384), (-217, 384), color = _RED)
    nhl_rink['faceoff1_line1'] = draw_shape('line', (141.17, 423.4), (141.17, 377), color = _FACEOFF)
    nhl_rink['faceoff1_line2'] = draw_shape('line', (117.62, 423.4), (117.62, 377), color = _FACEOFF)
    nhl_rink['faceoff1_line3'] = draw_shape('line', (153, 406), (105.8, 406), color = _FACEOFF)
    nhl_rink['faceoff1_line4'] = draw_shape('line', (153, 394.4), (105.8, 394.4), color = _FACEOFF)
    nhl_rink['faceoff2_line1'] = draw_shape('line', (-141.17, 423.4), (-141.17, 377), color = _FACEOFF)
    nhl_rink['faceoff2_line2'] = draw_shape('line', (-117.62, 423.4), (-117.62, 377), color = _FACEOFF)
    nhl_rink['faceoff2_line3'] = draw_shape('line', (-153, 406), (-105.8, 406), color = _FACEOFF)
    nhl_rink['faceoff2_line4'] = draw_shape('line', (-153, 394.4), (-105.8, 394.4), color = _FACEOFF)
    nhl_rink['goal_line1'] = draw_shape('line', (64.7, 516.2), (82.3, 580))
    nhl_rink['goal_line2'] = draw_shape('line', (23.5, 516.2), (23.5, 493))
    nhl_rink['goal_line3'] = draw_shape('line', (-64.7, 516.2), (-82.3, 580))
    nhl_rink['goal_line4'] = draw_shape('line', (-23.5, 516.2), (-23.5, 493))
    nhl_rink['outer_arc1'] = draw_arc(200, 580, 217, 574, 247, 532, 250, 516.2)
    nhl_rink['outer_arc2'] = draw_arc(-200, 580, -217, 574, -247, 532, -250, 516.2)
    nhl_rink['goal_arc1'] = draw_arc(23.5, 493, 20, 480, -20, 480, -23.5, 493)
    nhl_rink['goal_arc2'] = draw_arc(17.6, 516.2, 15, 530, -15, 530, -17.6, 516.2)

    # convert rink shapes dictionary to a list of shapes to use with plotly
    rink_shapes = [nhl_rink[key] for key in nhl_rink]

    return rink_shapes

Data Wrangling - NHL Shooting and Scoring Data

We will focus on the 2018 season and playoffs and make the assumption that the distributions will be similar for other seasons.

In [3]:
# set up filenames

path = ""

teams = pd.read_csv(path + "team_info.csv")
game_plays = pd.read_csv(path + "game_plays_2018.csv")
games = pd.read_csv(path + "game.csv")
game_player = pd.read_csv(path + "game_plays_players_2018.csv")
player_info = pd.read_csv(path + "player_info.csv")

Let's take a quick look at the dataframes we will be working with for this portion of the project.

In [4]:
# display(game_plays.head(1))
# display(teams.head(1))
# display(games.head(1))
# display(game_player.head())
# display(player_info.head())

Data Wrangling - Cleaning

As mentioned in the introduction to this section, we needed to do a lot of cleaning to build a dataframe that is in the right form for the intended visualization. We will be using four tables that will need to be merged and cleaned.

In [5]:
# filter for 2018 regular season games
# game id is of format ssss-tt-nnnn where:
#     ssss = first year of season (ie. 2018 for 2018-2019 season)
#     tt = two digits for type of game (02 for regular season, 03 for playoffs)
#     nnnn = four digits for the game number as there are 31 * 82 / 2 = 1271 regular season games + playoffs
#
# also take this opportunity to just filter for shots and goals

plays_2018 = game_plays[(game_plays.game_id >= 2018000000) & (game_plays.game_id < 2019000000) &
                        (game_plays.event.isin(['Shot', 'Missed Shot', 'Goal']))
                       ].copy().reset_index()

# break up the game id into season, game type, and game number
plays_2018['season'] = plays_2018.game_id // 1000000
plays_2018['game_type'] = plays_2018.game_id // 10000 - plays_2018['season'] * 100
plays_2018['game_num'] = plays_2018.game_id - (plays_2018.season*100 + plays_2018.game_type) * 10000

# a small function to convert a team id into a team name and drop the id column from the dataframe
def id_to_team_name(df, teams, df_id, team_type):
    new = pd.merge(left = df,
                   right = teams[['team_id', 'teamName']],
                   left_on = [df_id],
                   right_on=['team_id'])
    new = new.drop(columns = ['team_id', df_id])
    new = new.rename(columns={'teamName': team_type})

    return new

# get the player id for the "for event"
p_filter = game_player.playerType.isin(['Shooter', 'Scorer'])
plays_2018 = pd.merge(left = plays_2018,
                      right = game_player[p_filter][['play_id', 'player_id']],
                      on = 'play_id')
# convert the player id into a player name
plays_2018 = pd.merge(left = plays_2018,
                      right = player_info[['player_id', 'firstName', 'lastName', 'primaryPosition']],
                      on = 'player_id')
plays_2018['fullName'] = plays_2018['lastName'] + ', ' + plays_2018['firstName']

# replace id's with team names
team_ids = ['team_id_for', 'team_id_against']
team_names = ['team_for', 'team_against']
for team_id, team_name in zip(team_ids, team_names):
    plays_2018 = id_to_team_name(plays_2018, teams, team_id, team_name)

#convert columns to categorical to make them more efficient
cat_cols = ['event', 'periodType', 'rink_side', 'primaryPosition']
for c in cat_cols:
    plays_2018[c] = plays_2018[c].astype('category')

# rescale the x, y coordinates into plotly coordinates
#   NHL data oriented so that X is the long direction, and Y is across the ice
#      -100 <= x <= 100
#      -42 <= y <= 42
#
#   the rink we build in plotly is oriented so that y is the long direction and x is across the ice
#      -250 <= x <= 250
#      -0 <= y <= 580 (we are only doing half the ice)
#
# take the absolute value to flip everything to the same side
plays_2018['py_y'] = plays_2018['st_x'].abs() * 580 / 100
plays_2018['py_x'] = (-plays_2018['st_y'] + 42) * 500/84 - 250
plays_2018['jitter_y'] = plays_2018['py_y'] + np.random.normal(0, 2/3)
plays_2018['jitter_x'] = plays_2018['py_x'] + np.random.normal(0, 2/3)

plays_2018 = plays_2018.drop(columns = ['x', 'y', 'st_x', 'st_y', 'player_id', 'play_id', 'game_id',
                                        'firstName', 'lastName', 'goals_away', 'goals_home',
                                        'secondaryType', 'play_num', 'periodTime', 'periodTimeRemaining',
                                        'dateTime', 'description', 'rink_side'])

Plot the Visualization

We will build a heatmap for shots and goals from the above dataframe.

In [6]:
# empty dataframe of x and y coordiantes to be used when we want the scatterplot to be empty
empty = pd.DataFrame({'py_x': [0], 'py_y': [0]})

# default heatamp trace
heatmap_trace = go.Histogram2dContour(
    x = empty['py_x'],
    y = empty['py_y'],
    hoverinfo = 'skip',
    name = 'density', ncontours = 3,
    colorscale = 'Hot', reversescale = True, showscale = False,
    contours = dict(coloring='heatmap'),
)

# scatterplot
shot_goal_trace = go.Scatter(
    x = empty['py_x'],
    y = empty['py_y'],
    mode = 'markers',
    name = 'goals',
    marker = dict(
        size = 6,
        color = 'blue'
    )
)

# build the layout for plotly
layout = go.Layout(
    title='NHL Rink',
    showlegend=True,
    xaxis=dict(
        showgrid=False,
        range=[-300, 300],
        showticklabels = False
    ),
    yaxis=dict(
        showgrid=False,
        range=[-100, 600],
        showticklabels = False
    ),
    shapes = draw_nhl_rink(),
    plot_bgcolor = 'rgba(0,0,0,0)',
    height = 700, #700
    width = 600 #600
)

# create a plotly widget
fig_game = go.FigureWidget(data = [heatmap_trace, shot_goal_trace], layout = layout)

# helper function to filter dataframe based on the state of the input widgets
def shot_goals_df(display_type, season_filter):
    goal_filter = plays_2018['event'] == 'Goal'
    shot_filter = ~goal_filter

    if display_type == '':
        df = empty
    elif display_type == 'Shots':
        df = plays_2018[season_filter & shot_filter]
    elif display_type == 'Goals':
        df = plays_2018[season_filter & goal_filter]
    return df

# primary eventhandler function
def update_heat(game_type, display_type, team = 'Flames', player = 'McDavid, Connor'):
    season_filter = plays_2018['game_type'] == game_type
    goal_filter = plays_2018['event'] == 'Goal'
    shot_filter = ~goal_filter

    if radio_mode.value == 'Season':
        df = shot_goals_df(display_type, season_filter)
        fig_game.data[0].x = df['py_x']
        fig_game.data[0].y = df['py_y']
        fig_game.data[1].x = empty['py_x']
        fig_game.data[1].y = empty['py_y']
    elif radio_mode.value == 'Game':
        if game_type == 2:
            low = 1
            high = 1272
        elif game_type == 3:
            low = plays_2018[plays_2018['game_type']==3]['game_num'].min()
            high = plays_2018[plays_2018['game_type']==3]['game_num'].max() + 1
        game_filter = (plays_2018['game_num'] == np.random.randint(low = low, high = high))
        df1 = plays_2018[game_filter & shot_filter]
        df2 = plays_2018[game_filter & goal_filter]
        fig_game.data[0].x = df1['py_x']
        fig_game.data[0].y = df1['py_y']
        fig_game.data[1].x = df2['jitter_x']
        fig_game.data[1].y = df2['jitter_y']
    elif radio_mode.value == 'Team':
        # force it to look at regular season so we don't need to check if a team was in the playoffs or not
        if display_type != '':
            season_filter = plays_2018['game_type'] == 2
            df = shot_goals_df(display_type, season_filter)
            df = df[df['team_for'] == team]
            fig_game.data[0].x = df['py_x']
            fig_game.data[0].y = df['py_y']
            fig_game.data[1].x = empty['py_x']
            fig_game.data[1].y = empty['py_y']
    elif radio_mode.value == 'Player':
        # force it to look at regular season so we don't need to check if a team was in the playoffs or not
        season_filter = plays_2018['game_type'] == 2
        df = plays_2018[season_filter & (plays_2018.fullName == player)]
        display(df)
        fig_game.data[0].x = df[shot_filter]['py_x']
        fig_game.data[0].y = df[shot_filter]['py_y']
        fig_game.data[1].x = df[goal_filter]['jitter_x']
        fig_game.data[1].y = df[goal_filter]['jitter_y']

#############################################
# Create the widgets
#############################################
dropdown_goals = widgets.Dropdown(
    options = ["", "Goals", "Shots"],
    value = "",
    description = 'Display:',
)

teams = list(plays_2018.team_for.sort_values().unique())
teams.insert(0, "")

dropdown_teams = widgets.Dropdown(
    options = teams,
    value = "",
    description = 'Team:',
)

player_list = list(plays_2018.fullName.sort_values().unique())
player_list.insert(0, "")
dropdown_players = widgets.Dropdown(
    options = player_list,
    value = "",
    description = 'Player:',
)

radio_mode = widgets.RadioButtons(
    options = ['Game', 'Season', 'Team', 'Player'],
    description = 'Mode:',
    disabled = False
)

radio_playoff = widgets.RadioButtons(
    options = [('Regular', 2), ('Playoff', 3)],
    description = 'Game Type:',
    disabled = False
)

#############################################
# Define eventhandlers
#############################################
def dropdown_goals_eventhandler(change):
    update_heat(radio_playoff.value, change.new)

def dropdown_teams_eventhandler(change):
    update_heat(radio_playoff.value, dropdown_goals.value, change.new)

def dropdown_players_eventhandler(change):
    update_heat(radio_playoff.value, dropdown_goals.value, player = change.new)

def radio_playoff_eventhandler(change):
    update_heat(change.new, dropdown_goals.value)

def radio_mode_eventhandler(change):
    if change.new == 'Game':
        dropdown_goals.layout.visibility = 'hidden'
        dropdown_teams.layout.visibility = 'hidden'
        dropdown_players.layout.visibility = 'hidden'
        radio_playoff.layout.visibility = 'visible'
        update_heat(radio_playoff.value, dropdown_goals.value)
    elif change.new == 'Season':
        dropdown_goals.layout.visibility = 'visible'
        dropdown_teams.layout.visibility = 'hidden'
        dropdown_players.layout.visibility = 'hidden'
        radio_playoff.layout.visibility = 'visible'
    elif change.new == 'Team':
        dropdown_teams.layout.visibility = 'visible'
        dropdown_goals.layout.visibility = 'visible'
        dropdown_players.layout.visibility = 'hidden'
        radio_playoff.layout.visibility = 'hidden'
    elif change.new == 'Player':
        dropdown_teams.layout.visibility = 'hidden'
        dropdown_players.layout.visibility = 'visible'
        dropdown_goals.layout.visibility = 'hidden'
        radio_playoff.layout.visibility = 'hidden'

#############################################
# Register event handlers with widgets
#############################################
radio_mode.observe(radio_mode_eventhandler, names = 'value')
radio_playoff.observe(radio_playoff_eventhandler, names = 'value')
dropdown_goals.observe(dropdown_goals_eventhandler, names = 'value')
dropdown_teams.observe(dropdown_teams_eventhandler, names = 'value')
dropdown_players.observe(dropdown_players_eventhandler, names = 'value')
In [7]:
# display widgets
display(radio_mode)
display(radio_playoff)
display(dropdown_goals)
display(dropdown_teams)
display(dropdown_players)
fig_game

Conclusions

Overall, we learned a great deal about certain aspects of NHL Hockey and certain types of advantages. We saw that home ice advantage is a quantifiably measurable entity with a positive impact on results. Additionally, in our investigation of power plays we confirmed that team performance has a relationship with how effective their power play is at key moments. Finally, it was interesting to see our intuitions realized with respect to shooting and scoring. It was especially interesting to visualize the individual playing styles of specific players.

The next steps for this project may include looking for additional statistics that can provide meaning and insight into team performance for teams to improve upon and focus on. Additionally, we could make modifications to the heat map so that teams could filter for their opposition to gain insight in how the other team plays and what defensive strategies would be best to counter them.

REFERENCES

Ellis, M. (2019, June). NHL Game Data, Version 4 [Online]. Available at: https://www.kaggle.com/martinellis/nhl-game-data (Retrieved September 26, 2019)

REFERENCES for Python and libraries used:

Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) (pandas)

McKinney, W. (2017). Python for Data Analysis. Sebastopol: O'Reilly. (Pandas)

Plotly Technologies Inc, (2015). Plotly Python Open Source Graphing Library [Online] Available at: https://plot.ly/python/ (Accessed: 10 October 2019) (Plotly)

Pravendra (2016) 'NHL Shots Analysis Using Plotly Shapes', modern data, 24 November. Available at: https://moderndata.plot.ly/nhl-shots-analysis-using-plotly-shapes/ (Access: 8 October 2019) (NHL Rink)

Project Jupyter Revision (2017). ipywidgets User Guide [Online] Available at: https://ipywidgets.readthedocs.io/en/latest/ (Accessed: 10 October 2019) (ipywidgets)